Okay, so welcome back everybody and welcome to our lecture on deep reinforcement learning.
So today we want to look a bit into game theory and actually strategies how to learn playing
games.
So we actually divide this topic into several blocks.
Before we really go into the concept of reinforcement
learning, we first look into the process of the sequential
decision making.
So this is not yet really reinforcement learning, but
it's one step towards the actual game problem, where we
look essentially into state free systems and these systems, they can just do actions and
you can choose which action produces probably the highest
reward.
And based on that, we will go ahead and expand the system with an additional state and then
we are really in mark of decision processes and go
towards reinforcement learning.
And we conclude the lecture by the topic of deep reinforcement learning, where we then
really apply the things that we've learned so far in deep learning to that particular
topic.
Okay, so let's start and we start with the concept of, as I said, sequential decision
making.
And here we essentially have a problem that is also referred to as the multi-armed bandit
problem.
So you can choose from several actions.
You have like an action that you can choose to a time t and there is a set of actions,
capital A. So you could say, for example, if you have these
armed bandits in the casino, then your action is you choose one of the machines and pull
the lever and then you get some reward.
But you don't know about the state and all of them essentially behave the same.
So you just want to choose a specific of those bandits.
So then you take an action and the action at time t then has a different and unknown
probability density function that generates some reward r.
So r will be the rewards and what we want to get, of course, is the maximum reward.
We want to win money or we want to get a high score or something like that.
Most reward is also very, so in this case it's very simple because I pull and I get
immediately a reward.
Later in the games it will be much more complicated because the reward may be generated only at
a very distant future.
If you consider playing chess, for example, then you get the reward at the very end only
when you win the game.
So that's much more complicated.
Here we get an immediate reward, we pull and then each of those single-armed bandits will
then generate some reward and depending on which machine you've chosen you get a different
one and maybe they have a different probability distribution, right?
Okay, so then because we now have actions and we have rewards, we somehow have to choose
an action.
And we can, the choosing of an action we can formalize as a policy.
So these are the important concepts.
They will also be important for our reinforcement learning.
So we choose an action and the action generates a reward and we choose the action by some
policy and the policy we denote here with pi and we can also model this as a kind of
Presenters
Zugänglich über
Offener Zugang
Dauer
01:11:46 Min
Aufnahmedatum
2019-07-04
Hochgeladen am
2019-07-04 18:09:02
Sprache
en-US